A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing

Abstract Motivation Low-copy repeats (LCRs) or segmental duplications are long segments of duplicated DNA that cover > 5% of the human genome. Existing tools for variant calling using short reads exhibit low accuracy in LCRs due to ambiguity in read mapping and extensive copy number variation. Variants in more than 150 genes overlapping LCRs are associated with risk for human diseases. Methods We describe a short-read variant calling method, ParascopyVC, that performs variant calling jointly across all repeat copies and utilizes reads independent of mapping quality in LCRs. To identify candidate variants, ParascopyVC aggregates reads mapped to different repeat copies and performs polyploid variant calling. Subsequently, paralogous sequence variants that can differentiate repeat copies are identified using population data and used for estimating the genotype of variants for each repeat copy. Results On simulated whole-genome sequence data, ParascopyVC achieved higher precision (0.997) and recall (0.807) than three state-of-the-art variant callers (best precision = 0.956 for DeepVariant and best recall = 0.738 for GATK) in 167 LCR regions. Benchmarking of ParascopyVC using the genome-in-a-bottle high-confidence variant calls for HG002 genome showed that it achieved a very high precision of 0.991 and a high recall of 0.909 across LCR regions, significantly better than FreeBayes (precision = 0.954 and recall = 0.822), GATK (precision = 0.888 and recall = 0.873) and DeepVariant (precision = 0.983 and recall = 0.861). ParascopyVC demonstrated a consistently higher accuracy (mean F1 = 0.947) than other callers (best F1 = 0.908) across seven human genomes. Availability and implementation ParascopyVC is implemented in Python and is freely available at https://github.com/tprodanov/ParascopyVC.


Supplementary Tables
Supplementary Table 1: Comparison of the precision, recall and 1 scores for variant calling on seven WGS datasets from the GIAB. The first two columns show the WGS sample and the number of benchmarking variants. Best precision, recall and 1 score for each sample are printed in bold.

Unknown paralog-specific copy number
It is possible that sample paralog-specific copy number is not fully known: for example in a three-copy duplication sample aggregate copy number was identified asˆ= 6 and paralogspecific copy number as = (2, ?, ?). In such cases we virtually combine repeat copies with unknown paralog-specific copy numbers into a new, extended repeat copy.
It is possible that several paralog-specific and several aggregate genotypes are referencecompatible if a PSV has different reference alleles within a single extended repeat copy.
Consequently, we discard informative PSVs, if one of the alleles appears in the reference sequences of all extended repeat copies.

Identifying possible locations for sequencing reads in LCRs
In order to identify possible sequencing read locations, we utilize the overlaps between reads, sequence variants, and paralogous sequence variants (PSVs). For each variant ParascopyVC stores information about its positions across all repeats copies. For a given read we evaluate the variants covered by the read, combine all variant positions across all repeat copies, and sort them by the genomic coordinate. Next, we select such clusters of variant coordinates that the distance between the first and last coordinate does not exceed the read length by more than 10 bp. Additionally, we combine two clusters of variant coordinates into one, if the distance between them is smaller than 50 bp, but assume that such cluster cannot represent a correct read location. Forbidden read locations receive 10 −20 location probability penalty.
In certain cases, we can either discard one of the possible locations, or be certain that the original location is correct. Consider a read with an original alignment to one of the repeat copies, that was remapped to another repeat copy to get a pooled alignment. We say that an alignment has unique tail if it contains at least 15 bp that do not overlap any entry in the homology table. If the original alignment has high mapping quality (≥ 50), has a unique tail, and has the same or fewer clipped basepairs than the pooled alignment (or if the original alignment matches the pooled alignment and has no clipping at all) -we say that the original alignment location is certainly correct. Finally, if the original alignment is much better than the pooled alignment (aligned length is at least 15 bp more) -we say that the pooled alignment is certainly incorrect.
We cannot easily confirm that the original location is correct, as it is possible that there exists a third repeat copy that the read can map to, and checking for such cases would significantly hamper the execution time.

Output files and quality scores
ParascopyVC generates two output variant call format (VCF) files: aggregate and paralog-specific.
In an aggregate VCF file, reference allele of a sequence variant or PSV is the reference sequence of the variant in the first repeat copy of the duplication. In case of PSVs, the reference alleles from all other repeat copies are stored as alternative alleles. For each sample, ParascopyVC provides the most probable aggregate genotypeˆand its Phred quality score, calculated as (ˆ) = −10 · log 10 1 − (ˆ) .
In a paralog-specifc VCF file, ParascopyVC outputs variants once for each repeat copy. For each sample and each repeat copy , ParascopyVC finds the most probable marginal paralog- . Phred quality scores for the marginal paralog-specific genotypes are calculated in a similar manner to the aggregate genotype qualities: ( ( ) ) = −10 · log 10 1 − ( ( ) ) . In the paralog-specific output VCF file, ParascopyVC uses the same formula to calculate variant qualities for both sequence variants and PSVs: ( ) ( ) = −10 ∈ log 10 ( * ), where * is the reference paralog-specific genotype on the repeat copy (consisting entirely of the reference allele * ). In other words, ParascopyVC calculates the Phred quality score of the probability that all samples exhibit the reference paralog-specific genotype on the repeat copy .